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Accelerated Training for Large Feedforward Neural Networks 


Slawomir W. Stepniewski and Charles C. Jorgensen 


Summary 

In this paper we introduce a new training algorithm, the 
scaled variable metric method. Our approach attempts to 
increase the convergence rate of the modified variable 
metric method. It is also combined with the RBackprop 
algorithm, which computes the product of the matrix of 
second derivatives (Hessian) with an arbitrary vector. 

The RBackprop method allows us to avoid 
computationally expensive, direct line searches. In 
addition, it can be utilized in the new, “predictive” 
updating technique of the inverse Hessian 
approximation. We have used directional slope testing to 
adjust the step size and found that this strategy works 
exceptionally well in conjunction with the RBackprop 
algorithm. Some supplementary, but nevertheless 
important, enhancements to the basic training scheme 
such as, improved adjustment of a scaling factor for the 
variable metric update and computationally more 
efficient procedure for updating the inverse Hessian 
approximation are presented as well. We summarize by 
comparing the scaled variable metric method with four 
first- and second-order optimization algorithms, 
including a very effective implementation of the 
Levenberg-Marquardt method. Our tests indicate 
promising computational speed gains of the new training 
technique, particularly for large feedforward networks, 
i.e., for problems where the training process may be the 
most laborious. 

1. Introduction 

For some neural network applications requiring high 
modeling/mapping accuracy, it may not be sufficient to 
employ first order training methods based on the 
gradient descent schemes with adjustable learning rates. 
Although these training algorithms are relatively 
inexpensive computationally, they could perform poorly, 
because the search directions can partially overlap and 
interfere with each other producing the undesirable 
effect of impairing previous minimization efforts during 
subsequent iterations (refs. 1 & 2). Moreover, for 
problems with rapid changes of the objective function, 
small variations in the step sizes may result in 
considerably different gradient directions and even entire 
search paths. 


In optimization theory, several solutions have been 
proposed to boost the effectiveness of consecutive 
directional searches (ref. 1). Conjugate gradient methods 
attempt to construct non-interfering directions based on a 
steady quadratic model of the objective function and the 
assumption of exact line searches along those search 
directions. The more effective Newton and trust-region 
methods rely on a more detailed quadratic model of the 
merit function rederived at each iteration. A serious 
drawback of these algorithms is the significant 
computational overhead of obtaining the matrix of 
second partial derivatives (Hessian) or its approximation. 

Quasi-Newton (secant) training methods also utilize a 
Hessian approximation or its inverse. The computational 
efficiency of quasi-Newton methods comes from the fact 
that a Hessian approximation is continuously built 
during function minimization. The updating process is 
substantially faster than computing a complete Hessian 
matrix. However, the lack of precise knowledge of 
second derivatives may have a negative impact on the 
training convergence rate. In practice, the algorithm may 
also be more susceptible to local minima and round-off 
errors. It may be advantageous, nevertheless, to use 
quasi-Newton methods for problems where other second 
order algorithms are computationally too expensive and 
gradient descent methods produce unsatisfactory results. 
In this paper, we present a new variation of one quasi- 
Newton method, the scaled variable metric (SVM) 
method. The method appears to be quite competitive 
with other leading training techniques. Our tests show 
that for large neural networks having more than several 
hundred weights, the SVM technique typically 
outperforms standard variable metric algorithms in 
convergence speed and in some cases it is also able to 
produce the most accurate neural models. 

2. Quasi-Newton Methods 

Although quasi-Newton optimization algorithms are 
most commonly understood as techniques to construct 
successive Hessian approximations or their inverses, 
many modem approaches view them primary as 
strategies to choose a series of search directions. These 
methods put less emphasis on the issue of convergence 
to the true Hessian. The majority of quasi-Newton 
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optimization algorithms utilize formulae that build 
inverse Hessian approximations (denoted as matrix D). 
Methods constructing direct Hessian approximations 
(refs. 3 & 4) are used less frequently. Our algorithm 
belongs to the group of variable metric methods, an 
important subclass of quasi-Newton algorithms that 
ensure positive definiteness of D. The Huang formula 
(refs. 4 & 5) defined by 
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*+i 


= D» + AD t = D „+p 


Aw t (*,Aw t +AT 2 D*Ag,) T 
(/f,Aw t +A: 2 DjAg t ) T Ag t 


(Z,Aw t + L,D 2 Ag t ) T Ag ( 


( 1 ) 


is a fairly general update for which the well-known 
BFGS (Broyden-Fletcher-Goldfarb-Shanno) and DFP 
(Davidon-Fletcher-Powell) formulae are special cases. 

In (1) Aw* = w* +] - w k is the vector of weight corrections 

and Ag* denotes the corresponding gradient change. 

The p parameter must be positive to preserve positive 
definite property of D*. The four other scalars K 2 , 

L,, and L & can be chosen arbitrarily except for L } = 0 and 
L 2 = 0 at the same time. Note that (1) allows D* to be 
unsymmetric. When updating Dousing (1), at every 
iteration the following condition is satisfied 

D* + iAg * =pAw* ( 2) 

An interesting property of the Huang update is that for 
strictly quadratic error function E{ w) with a positive 
definite Hessian H = d 2 Eld w 2 and the initial matrix D 0 

chosen so (D 0 )/2 j s also positive definite, the 

series D* — > pH* 1 when p is constant (refs. 5 & 4). In 
practice, most variable metric algorithms which utilize 
(1) with p * 1 do not preserve a constant p but rather 

attempt to tune it on-line. Nevertheless, convergence to 
pH* 1 is an appealing property and limiting frequent and 
large changes of this scaling factor may be beneficial. 

The Huang family of formulae offers an infinite number 
of choices for its adjustable parameters with minimal 
theoretical background as to how best set them 
optimally. In fact, for the strictly quadratic error function 
and exact line searches, the sequence of search directions 
is independent of the particular choice of Kj, K 2 , L„ Z^, 
and constant p (ref. 4). However, for non-quadratic 
functions and inexact line searches, different updating 
formulas derived from (1) are not equivalent. In this 
paper we consider the following choice for the K and L, 
(/=!, 2) parameters 


t 1 Ag[D t Ag t 
p Aw T Ag* 
Ag* p t Ag t 
AwjAg* 
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Substituting (3) in (1) leads to the expression 


D , = D + AD t = D. + — Aw Aw] - 


— (D t Ag,)(D,Ag 4 ) T +a,r,r; 


(4) 


which produces symmetric updates. In equation (4) 
a k = AgjD* Ag* » b k = Aw jAg* and r* = Aw* jb k - 
D* Ag*/a* are introduced for notational convenience. 
Because (4) is very similar to the BFGS formula except 
for the p scalar, this equation will be referred as the 
extended or modified BFGS formula. Various authors 
have tried to find the optimal setting for the p parameter 
(see ref. 3 for a review). It is the paradigm proposed and 
studied by Oren (ref. 6) which, in our view, offers an 
elegant mathematical justification of how to assess the 
useful range of p values. 

3. Convergence Acceleration 

The idea that the convergence rate of the variable metric 
methods could be controlled was originally suggested by 
Oren and Luenberger in the context of the self-scaling 
variable metric method (ref. 6). The concept is based on 
the theorem, which considers the positive definite 
quadratic form 


F(x) = —(x - x*) T h(x - x’) 

2 V (5) 

with respect to (x - x’) and an abstract optimization 
algorithm which aims to find the stationary point x . 

It is assumed that the minimization method uses exact 
linear searches along direction s* = -D*g* , where D* is 
assumed to be an arbitrary positive definite matrix and g* 
is the gradient of (5) evaluated at x*. For any starting 
point, the convergence rate of such an algorithm could 
be bounded by the inequity (ref. 6) 

F(x t )~ F(\) 


V(M t )-l V 
^(M,) + lJ (6) 


where k(M*) > 1 is the condition number defined as the 

ratio of the largest eigenvalue of M k to the smallest one. 
The matrix M k is given by the formula 

M* =H i/2 D*H 1/2 (7) 

Clearly, the fastest convergence can be obtained for 
Newton type algorithms when D* = H \ Then 
k(M*) = k(I) = 1 and the minimum is reached in only 

one step. The performance of the simple steepest descent 
method with exact line searches (D*= I) depends heavily 
on the type of the objective function characterized by the 
configuration of H eigenvalues (k(M*) = k(H)). When 
these eigenvalues are significantly different, F(x) forms 
a narrow “valley” and we may anticipate a poor 
convergence rate. Other search strategies, such as 
traditional variable metric methods, can increase (or 
decrease) the convergence rate through the D* matrix. 

It is interesting to note, that for badly selected D< it is 
possible that k(M*) » k(H) and our abstract 
optimization algorithm may perform worse than the 
steepest descent method. 

Figure 1 illustrates changes in location of M* 
eigenvalues after subtracting (D^Ag^XD^Ag*) 1 /^ 
in (4) then adding pAw*AwJ fb k and u k r k r k terms, 
respectively. All initial eigenvalues p, (/ =1,.,.,5) are 
either smaller or larger than p = 2 in the example. It is 
easy to see from figure 1 that each extended BFGS 


update (12) tends to move all eigenvalues of M k but the 
smallest one closer to each other. The parameter p 
determines the value of smallest eigenvalue. Figure 1 
shows that when the p scalar is constant, it behaves as an 
“attractor” for other eigenvalues that will tend 
monotonically (in a weak sense) to p in the subsequent 
iterations. In DFP or BFGS formulae p = 1 ; if this 
setting increases the condition number k(M*), then the 
convergence of the training method may suffer. One can 
easily construct examples when all eigenvalues of the 
initial matrix Mq = H I/2 D 0 H 1/2 = H (assuming D 0 = I) are 
located away from one. Then, assigning p = 1 in update 
(4) may cause convergence deterioration. It may then 
take a considerable number of iterations for the M* 
eigenvalues to gravitate to each other, so k(M*) will 
decrease and the optimization algorithm will recover its 
speed. 

4. Scaling Factor 

In practice, the gain in performance from the correct 
scaling factor p k is accumulated throughout several 
iterations. This makes it difficult to adjust p k in an on- 
line fashion by observing the effects of the error 
reduction in a short horizon. A proper setting of the p k 
parameter in (4), so the condition number k(M*) will 

not increase, requires certain information about M* 
eigenvalues. Since the elements of M* are not known, 
similarly to (ref. 6) we use the Rayleigh quotient R( ) 
to estimate the spread of eigenvalues. For a real and 



Figure 1. Plots tracking eigenvalue changes of the matrix M k due to the repeatedly applied update (4). The u movement ” 
of the smallest eigenvalue of M k is marked with the white squares. 
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symmetric matrix M k , the Rayleigh quotient is defined 
by R 0 (\) = \ T M k \/x T \ . The lower bound for the 

largest eigenvalue of M k is given by 

maxfljiJ) ^ «„(x), x * 0 


generated by the updating formula (4). In the SVM 
algorithm, we suppress variations of 1 lp k rather than 
p k by implementing a simple, low-pass filter using an 
exponential average 

p*‘=yp*-i + (i-yW c <c/ a * < 14 > 


Similarly, we may try to estimate the upper bound of the 
smallest eigenvalue of M* using R x {x) = x T M“ 1 x/x T x 


min( p. ) < R;\x) 9 x * 0 


(9) 


Since D* is positive definite and M* = H 1/2 D*H^, 
therefore, p, > 0. This allows us to omit the absolute 

value operators in (8) and (9). R 0 (x) and # { (x) are equal 
to the largest and smallest eigenvalues respectively, 
when x is the associated eigenvector; in other cases 
these estimators are less accurate. For certain vectors 
x the Rayleigh quotients can be computed rather 
inexpensively 


*oKAwJ 


AgjD t Ag, 

AwjAg, 


^-’(H^Ag.) 


Ag>w, 

AwjD, Aw, 


K 


( 10 ) 


(ii) 


The scalar c k = Aw, r Dj'Aw, in (1 1) can be calculated 
without inverting D*. Taking advantage of the 
relationship Aw* = -a k D* g* used to compute weight 

corrections (section 5) we can write 

c k = AwjD* l Aw t = -a t AwJg* (12 


where y is the weighting factor (0 < y< 1, typically, 
y = 0,9). The x scalar in (14) has heuristic roots and is 
used to drive p k slightly toward smaller values, which 
have typically better convergence properties; in the SVM 
algorithm we set X = l .2. 

Figure 2 displays changes of the scaling factor during the 
training procedure. Apparently, in this case the rescaling 
of the D* matrix is not necessary during the very first 
stages of training. However, as the training progresses 
other values of p*, different from 1.0, are more 
appropriate. 



Figure 2 . Changes of the scaling parameter 1/p k during 
training evaluated according to equa tion (1 4). Both 
filtered values and raw estimates v^jc t j a t are displayed. 


In our training algorithm p k is adjusted based on the 
geometric average of (29) and (30), i.e. 

Pt = ^(h'VK'KV) = (13) 

Unfortunately, in some optimization problems estimator 
(13) tends to vary rapidly. Many fluctuations of p k may 
be associated with the imperfect procedure of evaluating 
this parameter rather than the real changes of the 
eigenvalues and the error surface. In section 2 we 
mentioned possible benefits of slowly varying p k by 
pointing out that for constant p k -p the convergence 
D* pH' 1 is achieved for the sequence of matrices 


5. Step Length Calculation 


In the variable metric methods, the search direction is 
determined from the equation s fc = • Then, the 

step length and the weight correction Aw* = a k s* 
along a given ray are established. To avoid expensive 
direct line searches, we compute according to the 

formula (refs. 7 & 8) 


a, =- 




sTHsT+XsTsT 


(15) 


for which the vector Hs* is evaluated using the 
RBackprop algorithm (ref. 8). To achieve a more precise 


4 




step size adjustment, the A parameter is continuously 
tuned during the training process. This process is 
performed differently from (ref. 7). 

For a descent search direction and sufficiently large 
value of A, it should always be possible to find such an 
a k that (15) would lead to the error minimization. 
However, this primary condition ( E(w k+l ) < E(w k ) ) for 
the step size to be accepted does not assure an efficient 
training strategy. Under certain circumstances, the step 
size defined by (15) may be either too small so the new 
point is placed far before the minimum or too large; in 
the latter case the algorithm overshoots the minimum 
along the given ray producing negligible error reduction. 
In both situations, the result is an unsatisfactory decrease 
of the objective function. For positive curvature 
(s*Hs* >0) and £(w) represented along the search 

direction by the function of one argument 

E(a k ) = E(w k +a k s k ) a simple criterion against too 

long step size is (refs. 4 & 9) 

£(«*)< £( 0) + /5a k E'(0) = £(0 )+ Pa k g]s k (16) 


A does not provide reliable information on how much the 
step size could be extended. 

In our algorithm, the A parameter is bounded between 
macheps 1 and 10 i6 for a double precision 
implementation. Violation of the upper constraint was 
chosen to signal the failure to minimize E( w). It may 
be possible that D* is no longer positive definite due to 
the round-off errors and the search direction is not a 
descent one. In such a case, D* is reset to the identity 
matrix and consequently, the new search direction 
s*=-g* is selected. 

Note that the training method described here does not 
automatically reset D* to the identity matrix every N 
iterations as in the cyclic methods ( N - total number 
of weights). We argue that, for large optimization 
problems, discarding previously acquired information 
about the second derivatives in such predominant 
fashion is often unnecessary. In our method, the D* 
matrix is reset when it is evident that its positive definite 
property is lost, i.e., a k < 0. In addition, another test is 

performed 


where /} (0 < p< 0.5) ) is a fixed parameter. Setting P 
to the values smaller than 0.5 relaxes condition (16) 
allowing longer steps to be taken. In our training 
algorithm we use P = 0.5. 

The SVM algorithm attempts to take full advantage of 
the presumed local quadratic nature of E(a k ) by 
reducing A as much as possible. The value of A is 
decreased by half on every iteration that satisfies (16). 
When the condition is violated but the reduction in error 
is achieved regardless, the new weight settings are 
accepted and the value for A is increased (multiplied 
by two) so in subsequent iterations the step length a* 

will tend to be smaller. Finally, in case of failure 
(£( w* +1 ) > £(w*)), the value of A is increased more 

rapidly (multiplied by four) resulting in shorter step sizes 
a k in the next attempts to minimize the error function. 

In the special case when the denominator in (15) is not 
positive, A is reset to the new value 

sl»s t 

(17) 

This rather “desperate” act effectively reverses the 
Hessian sign in the step length calculations. Resetting 
A according to (17) acts as a safeguard in rare situations 
when the local information Hs* and the current value of 


A ^ — A + 2 


K 

l|AwJ 2 ||AgJ 2 


> macheps 


(18) 


to avoid updating D k with two noisy vectors, Aw* and 
Ag* . Note that (18) assures b k > 0. If condition (16) is 
false, the D* update should be skipped. Furthermore, 
when this situation is detected in several consecutive 
iterations the Hessian approximation should be reset as 
well. 


6. Predictive Updating 

In the basic quasi-Newton schemes, matrix D* is always 
updated after the actual step from w* to w* +1 is made. 
Therefore, the choice of the search direction relies only 
on previously acquired information, without precise 
knowledge of what could be expected in the next step. 
The RBackprop algorithm could be used to partially 
compensate for this deficiency by probing desired 
directions. 


1 For floating point computations, the machine precision 
(macheps) may be defined as the smallest value so 1 .0 and 
1 .0+macheps have different representations in the computer 
memory. 
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Table 1. Final training errors and average CPU time used to complete a single training task. 


Method 

SVM 

LM 

BFGS 

BFGS 

RPROP 

Run 




limited memory 


1 

| 


HSIllilBi 

■■ 

6.606e-03 

2 




■ 

6.691 e-03 

3 

4.201 e-03 

4.286e-03 

4.975e-03 

5.347e-03 

6.790e-03 

4 

4.299e-03 

4.423e-03 

5.351e-03 

5.706e-03 

6.702e-03 

5 

4.180e-03 

3.987e-03 

5.051 e-03 

5.157e-03 

6.644e-03 

6 

4.065e-03 


5.I70e-03 

5.220e-03 

7.147e-03 

7 

4,126e-03 


5.0I5e-03 

5.297e-03 

6.218e-03 

8 

3.906e-03 

3.809e-03 

5.1 23e-03 

5.376e-03 

6.670e-03 

9 

3.762e-03 

3.992e-03 

5.053e-03 

5.391 e-03 

6.802e-03 

10 

4.389e-03 

4.477e-03 

5.269e-03 

5.428e-03 

6.667e-03 

Iterations 


200 

1000 

1000 

1000 

CPU time 

25 min 

14 h 40 min 

26 min 

22 min 

5 min 


The idea of predictive updating is not complicated. A 
search direction that is computed as in the traditional 
variable metric algorithms serves as the first 
approximation of the Final search ray. Along this 
direction, information about second derivatives is 
collected using the RBackprop method. This information 
is used to update the D* matrix one more time before 
calculating the final search direction. 

The initial guess for the search direction is therefore 
s k = -D<g* . In a small neighborhood of the current 
point w* , the error function can be approximated by a 
quadratic model 


positive impact of the predictive update in practical 
problems. In the majority of tests, the convergence was 
faster and the final results were better in comparison to 
the method that used a single BFGS update per epoch. 

Computer implementation of the predicative updating is 
not complicated, as the additional program code is very 
similar to the main loop in extended BFGS routine. 
Moreover, since the network output and the gradient 
were already evaluated for w*, the product Hs* may be 
obtained relatively inexpensively for any s* (at the cost 
of one additional feedforward and one backward pass) 
using the RBackprop algorithm. 


£(w t ) = c + bX+jw[Hw t (19) 


where b is such a vector that 3£(w*)/3w - b+Hw* and 
c is constant. Operating on (19), for some suitable e and 
Aw - zs k we can write 


dE 


<9w 


w = w t +£s t 


dE_ 
dw w= 


w t 


= Ag = eHs t 


( 20 ) 


7. Efficient Update Implementation 


It is not unusual for feedforward neural networks to 
incorporate several hundred adjustable parameters. For 
variable metric methods, this translates into increased 
costs for periodic updating of the D* matrix. In such 
cases it may be worthwhile to convert (4) into a formula 
that clearly looks like a rank two, symmetric update 

D* + i=D* + u k u]-v k y T k (21 


Relationship (20) suggests that the D* matrix may be 
updated using eHs k and es k in place of Ag and Aw in 
the equation defining the extended BFGS formula. Note 
that the e scalar will annihilate in this update. This yields 
the new matrix D* , which can used to compute the final 
search direction s* = -D*g* . 

Employing a supplementary, predictive update of the 
D* matrix involves some risks when the error function is 
highly non-quadratic. Our tests showed, however, a 


and utilize it in the updating algorithm. In equation (21) 
vectors \ k and u* are expressed by 




+ pA 


■Aw t -\ k 


( 22 ) 


Obviously, when updating the D* matrix, its symmetric 
property should also be exploited to avoid redundant 
calculations with respect to either lower or upper triangle 
part. Below, pseudo code is presented for executing the 






































extended BFGS variable metric update. Using this 
scheme requires three times less multiplications in the 
main routine loop in comparison to the procedure which 
implements (4) directly. 

v = D * Ag ; 

a = Ag T * v; 

b = Aw T * Ag; 

if (a > 0.0 AND 

b > macheps * (Aw T * Aw) * (Ag T * Ag) ) 

{ 

a = sqrt(a + ro * b) ; 
v = v / a; 
u = a / b * Aw - v; 
for (i = 0; i < n; i++) 

{ 

a = u [ i ] ; 
b = v [ i ] ; 

D[i][i] += a * a - b * b; 
for (j = i + 1; j <n; j++) 

{ 

D[j][i] = (D [ i ] [ j ] += a*u [ j ] -b*v [ j ] ) ; 

) 

> 

} 

8. Numerical Experiments 

Numerical experiments have been carried out to test the 
performance of the scaled variable metric method 
against other selected training techniques which are 
known to be effective and frequently used in practice. 
The following four training algorithms were chosen for 
the comparison: (i) Levenberg-Marquardt (LM) 
algorithm with the predicted error reduction (ref. 9), 

(ii) standard BFGS method (ref. 2), (iii) limited memory 
BFGS (ref. 9), and (iv) RPROP (ref. 10). Both BFGS 
methods employed Brent’s line search (ref. 1 1). A 
common updating routine of the D* matrix was 
implemented in the identical fashion in the SVM method 
as in other variable metric algorithms. All numerical 
tests were performed on a Pentium 200 MHz personal 
computer. 

The performance comparison of the SVM method was 
experimentally verified by training ten, 12-18-16-6 
feedforward neural networks (12 input sensors, 6 
outputs, 640 weights) using 1373 preprocessed data 
points acquired from the calibration process of a six- 
component wind tunnel strain-gage. Each of the ten 
training experiments used the same starting point for all 
the algorithms. Figure 3 presents the convergence curves 
of the five training algorithms. Clearly, the simplest 
RPROP algorithm exhibited the lowest convergence 
rate. The standard BFGS method and its limited memory 


version demonstrated a better performance but the SVM 
algorithm was superior to these techniques. On average, 
our algorithm was able to reach the error level of the 
standard BFGS method in 250-350 iterations. The 
steepest convergence curve belonged to the Levenberg- 
Marquardt algorithm. However, since the neural model 
had multiple outputs, the Levenberg-Marquardt training 
was substantially slower than other methods, requiring 
14 hours and 40 minutes to iterate 200 epochs. The SVM 
method (run for 1000 epochs) was able to surpass the 
Levenberg-Marquardt results in most cases, requiring 
only 25 minutes on average to complete training. The 
RPROP method was the fastest training technique in our 
comparison. It is important to note, however, that for the 
given problem, the RPROP algorithm was least accurate 
of all the methods under consideration, even when the 
number of iterations was increased to 5000. In this case, 
execution time was approximately 25 minutes. 


training error 



Figure 3. Convergence curves of different training 
algorithms . 

9, Conclusions 

In this report, we have presented a new scaled variable 
metric (SVM) method for training feedforward neural 
networks. The SVM technique utilizes the RBackprop 
algorithm (ref. 8) and the modified variable metric 
update, derived as a subclass of the Huang family 
formulae. The variable metric method is used to collect 
information about second derivatives of the error 
function with respect to network weights. It allows us to 
construct a relatively efficient strategy for choosing the 
sequence of search directions. The variable metric 
update was extended by an additional parameter p k , 
which plays a fundamental role in attempts to accelerate 
the convergence speed of the training procedure. We 
have shown that a special case of the Huang updating 
formulae can be efficiently combined with the 
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variable metric algorithm (ref. 6) to form an efficient 
training scheme. We have presented a new strategy for 
setting the scaling factor, which emphasizes the 
importance of limiting unnecessary fluctuations of p t . 

The RBackprop algorithm can be utilized in two ways: 
it allows us to avoid expensive one directional line 
searches, and it can be used in supplementary, predictive 
updating of the D* matrix. Predictive updating acquires 
information about second derivatives along the trial 
search direction before committing to the final step 
which can be chosen more precisely. In addition, matrix 
D k may benefit from the predictive updates by being able 
to track changes of the Hessian matrix faster. We have 
employed a modified strategy for adjusting the A 
parameter in the step length calculations using 
directional slope testing. We have found that this 
technique works exceptionally well in conjunction with 
the RBackprop algorithm. Finally, we have outlined a 
computationally more efficient scheme for updating the 
Djfc matrix. This is a somewhat overlooked aspect of 
many variable metric implementations but it becomes a 
rather important issue when large neural networks are 
being trained. 

Numerical experiments provide evidence that the 
theoretical background developed to estimate an 
appropriate range of settings for the p k scalar indeed 
works in practice, although the same theory indicates 
that acceleration may not always be possible. For some 
problems, where p k « 1 is a suitable choice, the standard 
BFGS method may be superior. In practical situations, 
however, the SVM algorithm may protect the variable 
metric algorithm from being stuck. The method also has 
the capacity to perform a continuous adjustment of the 
p k parameter rather than rescale the initial matrix D 0 only 
once. Interestingly, in some cases the initial scaling of 
the D 0 matrix is not necessary, but later adjustment of the 
p k parameter improves the training convergence. 

It seems that information provided by the RBackprop 
algorithm and inferred from gradient vectors may be 
more efficiently utilized in choosing the sequence of 
search directions when the p k scalar is allowed to be 
tuned on-line. Our experience with the SVM method is 
that its efficiency becomes evident when the size of the 
neural architecture increases and/or the difficulty 
achieving low error arises, perhaps due to the highly 
non-quadratic nature of the optimization problem. 
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